The goal of this final project is to build the most accurate model for predicting a person’s sleep efficiency score. We will implement several machine learning techniques in order to solve this regression problem, and gain a few new insights on improving our overall health via quality of sleep together.
Sleep is more important than many of us are willing to admit, especially as university students who are learning to balance living on our own for the first time. Personally, I feel like there aren’t enough hours in a day to balance classes, studying, exercise, socialization, and downtime. Instead, I turn to chugging caffeine to sustain the long hours of cramming, and digging into a few hours of my precious beauty sleep to decompress with friends or enjoy some screen-time before bed. Most days, I wake up poorly rested and barely scrape through the duties of the day. I feel groggy and frustrated by my lethargy… Especially if I had gotten the recommended 7-9 hours the night before! I’m sure many of you can relate, whether you’re a student or not!
Scientists consider 80% or above to be a normal sleep efficiency (SE) score. Most young adults who are in good health exhibit a SE of at least 90%. Unfortunately, the research seems to suggest that sleep efficiency only declines with age. That’s why I investigated sleep efficiency: a measure of the proportion of time spent in bed that we’re actually asleep during. To be precise, we define “time in bed” as the period beginning when we first try to fall sleep and ending when we wake-up for the final time.
Our aim is to predict the variable Sleep.efficiency: a
measure of the proportion of the time in bed spent asleep. Its values
may range from [0,1] although the lowest observed score within our data
is 0.5. Sleep efficiency is a critical aspect of gauging sleep quality
and overall well-being. This measure accounts for periods of wakefulness
and sleeplessness throughout the night. Our model will give us some
insight as to which potential variables improve and decline our sleep
efficiency, and to what degree. Sleep efficiency scores are also
indicative of individuals’ overall health–the ‘Hypersomnia Foundation’
claims that normal, healthy scores are 85% or higher.
For this Statistical Machine Learning project, we will be working
with data from the file "sleep_efficiency.csv". Our goal is
to build the best predictive model for the outcome
Sleep.efficiency, and this large task requires a detailed
step-by-step procedure. First, we’ll load our data, RStudio packages,
and try to gain an overall understanding our raw data. Once we run a few
lines of code, we can clean our data by removing the variables we don’t
need and convert the format of our predictors so they’re ready for our
recipe. We’ll split our data into training/testing sets, create 10 folds
for our cross-validation, and code a universal recipe to all five of our
models: Linear Regression, Ridge Regression, K-Nearest Neighbors, Random
Forests, and Gradient Boosted Trees. Once we train these models on our
folded data, we can determine our best two models based on our chosen
performance metric. The final step is to fit these two best models to
our testing data to evaluate actually how good our best model
can predict sleep efficiency scores. Are you ready for it?
First and foremost, let’s load all the necessary packages and the raw sleep-efficiency data into our workspace:
# Installing required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.0 ✔ tune 1.1.2
## ✔ infer 1.0.5 ✔ workflows 1.1.3
## ✔ modeldata 1.3.0 ✔ workflowsets 1.0.1
## ✔ parsnip 1.1.1 ✔ yardstick 1.3.0
## ✔ recipes 1.0.9
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(kknn)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(ggthemes)
library(recipes)
library(parsnip)
library(ranger)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(vip)
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
library(finalfit)
library(ISLR)
library(conflicted)
library(xgboost)
# Loading the CSV file
sleep <- read.csv("~/Desktop/sleep_efficiency.csv")
Before delving into any further processing, let’s begin our initial exploration of the raw data that we obtained directly from our CSV file. It’s extremely unlikely that data sets, especially as large as this one, are already “clean” and immediately ready for analysis. We will check for missing (NA) values to tidy up and evaluate whether or not we have to convert some of our variables into factors. Manipulating and cleaning up the data is an essential preparatory step for later data visualizations and analysis.
# Briefly overview raw data
head(sleep)
## ID Age Gender Bedtime Wakeup.time Sleep.duration
## 1 1 65 Female 2021-03-06 01:00:00 2021-03-06 07:00:00 6.0
## 2 2 69 Male 2021-12-05 02:00:00 2021-12-05 09:00:00 7.0
## 3 3 40 Female 2021-05-25 21:30:00 2021-05-25 05:30:00 8.0
## 4 4 40 Female 2021-11-03 02:30:00 2021-11-03 08:30:00 6.0
## 5 5 57 Male 2021-03-13 01:00:00 2021-03-13 09:00:00 8.0
## 6 6 36 Female 2021-07-01 21:00:00 2021-07-01 04:30:00 7.5
## Sleep.efficiency REM.sleep.percentage Deep.sleep.percentage
## 1 0.88 18 70
## 2 0.66 19 28
## 3 0.89 20 70
## 4 0.51 23 25
## 5 0.76 27 55
## 6 0.90 23 60
## Light.sleep.percentage Awakenings Caffeine.consumption Alcohol.consumption
## 1 12 0 0 0
## 2 53 3 0 3
## 3 10 1 0 0
## 4 52 3 50 5
## 5 18 3 0 3
## 6 17 0 NA 0
## Smoking.status Exercise.frequency
## 1 Yes 3
## 2 Yes 3
## 3 No 3
## 4 Yes 1
## 5 No 3
## 6 No 1
summary(sleep)
## ID Age Gender Bedtime
## Min. : 1.0 Min. : 9.00 Length:452 Length:452
## 1st Qu.:113.8 1st Qu.:29.00 Class :character Class :character
## Median :226.5 Median :40.00 Mode :character Mode :character
## Mean :226.5 Mean :40.29
## 3rd Qu.:339.2 3rd Qu.:52.00
## Max. :452.0 Max. :69.00
##
## Wakeup.time Sleep.duration Sleep.efficiency REM.sleep.percentage
## Length:452 Min. : 5.000 Min. :0.5000 Min. :15.00
## Class :character 1st Qu.: 7.000 1st Qu.:0.6975 1st Qu.:20.00
## Mode :character Median : 7.500 Median :0.8200 Median :22.00
## Mean : 7.466 Mean :0.7889 Mean :22.62
## 3rd Qu.: 8.000 3rd Qu.:0.9000 3rd Qu.:25.00
## Max. :10.000 Max. :0.9900 Max. :30.00
##
## Deep.sleep.percentage Light.sleep.percentage Awakenings
## Min. :18.00 Min. : 7.00 Min. :0.000
## 1st Qu.:48.25 1st Qu.:15.00 1st Qu.:1.000
## Median :58.00 Median :18.00 Median :1.000
## Mean :52.82 Mean :24.56 Mean :1.641
## 3rd Qu.:63.00 3rd Qu.:32.50 3rd Qu.:3.000
## Max. :75.00 Max. :63.00 Max. :4.000
## NA's :20
## Caffeine.consumption Alcohol.consumption Smoking.status Exercise.frequency
## Min. : 0.00 Min. :0.000 Length:452 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:0.000 Class :character 1st Qu.:0.000
## Median : 25.00 Median :0.000 Mode :character Median :2.000
## Mean : 23.65 Mean :1.174 Mean :1.791
## 3rd Qu.: 50.00 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :200.00 Max. :5.000 Max. :5.000
## NA's :25 NA's :14 NA's :6
Running these lines of code reveals a general informational overview of our data. The result summary reveals that there are missing values in our data, but don’t worry! This is quite a common occurrence for data sets—whether they are large or small.
In a sleep study setting, it’s likely to encounter missing or inaccurate data as a consequence of technical issues and human error. We can either choose to omit or impute missing values within a data set, depending on the proportion relative to the entire data set and the distribution of the missing predictors. Let’s explore their rough distribution with a missing values plot!
# Missing value plot from 'finalfit' package
sleep %>%
missing_plot()
Notice that only 4 of the predictors have missing values. Let’s run this next chunk to figure out exactly how many missing values we’re dealing with:
colSums(is.na(sleep))
## ID Age Gender
## 0 0 0
## Bedtime Wakeup.time Sleep.duration
## 0 0 0
## Sleep.efficiency REM.sleep.percentage Deep.sleep.percentage
## 0 0 0
## Light.sleep.percentage Awakenings Caffeine.consumption
## 0 20 25
## Alcohol.consumption Smoking.status Exercise.frequency
## 14 0 6
Awakenings, Caffeine.consumption,
Alcohol.consumption, and Exercise.frequency
have 20, 25, 14, and 6 missing values, respectively… Not too many after
all! However, these predictors with missing values aren’t appropriate
for linear imputation because their relationship with other predictor
variables won’t always follow a linear trend. Mean or median imputation
is inappropriate as well because many of the values aren’t missing
completely at random and the non-missing observations in
Awakenings have a median of zero.
Therefore, let’s opt to remove the missing values and update the data to reflect this adjustment!
sleep <- sleep %>%
drop_na()
Now, we’re interested in understanding the scale of the updated data set.
# Call dim() to see # of rows & columns
dim(sleep)
## [1] 388 15
# Viewing the variable names
names(sleep)
## [1] "ID" "Age" "Gender"
## [4] "Bedtime" "Wakeup.time" "Sleep.duration"
## [7] "Sleep.efficiency" "REM.sleep.percentage" "Deep.sleep.percentage"
## [10] "Light.sleep.percentage" "Awakenings" "Caffeine.consumption"
## [13] "Alcohol.consumption" "Smoking.status" "Exercise.frequency"
Upon removing missing values, our data set currently contains 388 rows of observations and 14 columns of predictor variables.
We will perform the following cleaning procedures to remove
unnecessary variables and convert any categorical predictor variables
into factors. Let’s remove ID, which is simply a unique
number assigned to each sleep study participant for identification
purposes.
# Removing the variable "ID"
sleep <- sleep[, !(names(sleep) %in% "ID")]
# Converting 'Gender' 'Smoking.status' to factors
sleep$Gender <- as.factor(sleep$Gender)
sleep$Smoking.status <- as.factor(sleep$Smoking.status)
Of the 388 observed participants, ages largely range anywhere from 9
to 69 years old. Age plays a large role in human sleep patterns, as well
as certain lifestyle choices. Let’s create a new predictor from
Age to categorize each person into one of four age
demographics: child, youth, adult, and senior! Age demographic
splits sourced from the National Institutes of Health (NIH).
# Specifying the age breaks
age.cutoff.criteria <- c(0, 12, 18, 59, Inf)
# Creating new predictor specifying age group
# A factor with 4 levels
sleep$Age.group <- cut(sleep$Age, breaks = age.cutoff.criteria,
labels = c("Child", "Youth", "Adult", "Senior"),
include.lowest = TRUE)
sleep %>%
ggplot(aes(x=Age.group)) +
geom_bar()
The variables Bedtime and Wakeup.time are
currently character strings, both formatted as the date (YYYY-MM-DD)
followed by the time (HR:MIN:SEC). Our focus is the bedtime and wakeup
times, yet the current format is difficult to implement in predictive
and classification settings. Thus, we will remove the date from the
strings and convert the predictors to a numeric form since the
information isn’t useful for our purposes.
# Date-time conversion allows us to manipulate Bedtime & Wakeup.time
sleep$Bedtime <- as.POSIXct(sleep$Bedtime,
format = "%Y-%m-%d %H:%M:%S")
sleep$Wakeup.time <- as.POSIXct(sleep$Wakeup.time,
format = "%Y-%m-%d %H:%M:%S")
# Extract hour() & minute() -- from 'lubridate' package
Bedtime.hour <- hour(sleep$Bedtime)
Bedtime.mins <- minute(sleep$Bedtime)
Wakeup.hour <- hour(sleep$Wakeup.time)
Wakeup.mins <- minute(sleep$Wakeup.time)
# Update Bedtime, Wakeup.time to a numeric form (hours)
sleep$Bedtime <- Bedtime.hour + (Bedtime.mins / 60)
sleep$Wakeup.time <- Wakeup.hour + (Wakeup.mins / 60)
Before delving into each of the predictors, should we perform a quick sanity check?
str(sleep)
## 'data.frame': 388 obs. of 15 variables:
## $ Age : int 65 69 40 40 57 27 53 41 11 50 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 2 1 1 2 ...
## $ Bedtime : num 1 2 21.5 2.5 1 21 0.5 2.5 1 0.5 ...
## $ Wakeup.time : num 7 9 5.5 8.5 9 3 10.5 8.5 10 8.5 ...
## $ Sleep.duration : num 6 7 8 6 8 6 10 6 9 8 ...
## $ Sleep.efficiency : num 0.88 0.66 0.89 0.51 0.76 0.54 0.9 0.79 0.55 0.92 ...
## $ REM.sleep.percentage : int 18 19 20 23 27 28 28 28 18 23 ...
## $ Deep.sleep.percentage : int 70 28 70 25 55 25 52 55 37 57 ...
## $ Light.sleep.percentage: int 12 53 10 52 18 47 20 17 45 20 ...
## $ Awakenings : num 0 3 1 3 3 2 0 3 4 1 ...
## $ Caffeine.consumption : num 0 0 0 50 0 50 50 50 0 50 ...
## $ Alcohol.consumption : num 0 3 0 5 3 0 0 0 0 0 ...
## $ Smoking.status : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 1 2 ...
## $ Exercise.frequency : num 3 3 3 1 3 1 3 1 0 3 ...
## $ Age.group : Factor w/ 4 levels "Child","Youth",..: 4 4 3 3 3 3 3 3 1 3 ...
Nice! We can confirm that all the variables are prepared in the correct format for our recipe.
Upon analyzing and tidying our raw data, we have removed unnecessary
variables and converted our predictors into appropriate formatting to
yield a clearer analysis. Now we are ready for our visual exploratory
analysis! Here are the predictor variables we plan to use in our model
recipe to predict sleep efficiency scores, followed by a brief
description: - Age : Age of the test subject, as an integer
- Gender : Gender of the test subject (male/female),
converted to factor - Bedtime : Time (HR:MIN:SEC) at which
the subject goes to bed each night, as a character string; time in hours
ranging from [0,23] - Wakeup.time : Time (HR:MIN:SEC) at
which the subject wakes up each morning, as a character string; we
extracted time in hours ranging from [3,12] -
Sleep.duration : The total number of hours the subject
slept, as a double - REM.sleep.percentage : Percentage of
total sleep-time spent in REM sleep, as an integer ranging from [0,100]
- Deep.sleep.percentage : Percentage of total sleep-time
spent in deep sleep, as an integer ranging from [0,100] -
Light.sleep.percentage : Percentage of total sleep time
spent in light sleep, as a double - Awakenings : The
average number of times the subject wakes up during the night, as a
double - Caffeine.consumption : Amount of caffeine (in
milligrams) consumed in the 24 hours prior to bedtime -
Alcohol.cons : Amount of alcohol (in ounces) consumed in
the 24 hours prior to bedtime - Smoking.status : Whether or
not the subject smokes (Yes/No), converted to factor -
Exercise.frequency : The average number of times the
subject exercises each week, as a double - Age.group : A
factor with 4 levels, categorizing each participant into one of four age
demographics
To gain some understanding between the response variable and its predictors, we should generate a simple distribution plot and a correlation matrix to identify any possible relationships among variables. Then, we’ll create several visualizations to observe the impact of these predictors of interest on our outcome variable.
Let’s first explore the distribution of our outcome variable
Sleep.efficiency.
sleep %>%
ggplot(aes(x=Sleep.efficiency)) +
geom_bar(aes(fill = Age.group)) +
labs(x = "Sleep Efficiency Score", y = "# of Occurences Per Score",
title = "Distribution of Sleep Efficiency Scores Per Age Group") +
theme_bw()
Sleep efficiency scores can lie anywhere from 0 to 1. In our data
set, we observe that Sleep.efficiency ranges from 0.5 to
1.0, inclusive. This insight is quite intuitive—a zero-score implies
that the subject did not sleep. If we observe the bar-chart’s peak, the
most frequent score is 0.9. Our project task will be to build a
predictive model. An overwhelming majority of our observations fall into
the adult age group, so it may be wise to not introduce this categorical
variable into our model setup…
Next, let’s create a correlation matrix of all our numeric variables
which we’ll use to produce a correlation heat map. This map will allow
us to visually detect correlations among our predictors and the outcome,
Sleep.efficiency.
# Extract numeric predictors
sleep.numer <- sleep %>%
select(where(is.numeric))
# Sleep correlation plot
sleep.heatplot <- corrplot(cor(sleep.numer), order = 'AOE',
addCoef.col = 1, number.cex = 0.3)
Surprising… There aren’t as many relationships as expected!
First, it’s obvious that
Light.sleep.percentage-Deep.sleep.percentage
and Bedtime-Wakeup.time are strongly negative
correlated since they are opposite (or complementary) measures of each
other. The pairs have correlation scores of -0.98 and -0.77,
respectively, which will most likely generate the issue of
multi-collinearity in our model. In addition, I realized that
Age.group may be useful for visual EDA, but will probably
introduce redundancy if used in combination with the Age
predictor. Let’s resolve this problem by not considering one of the two
conflicting variables from each pair, as well as Age.group,
in our recipe formulation.
Sleep.duration and Wakeup.time also exhibit
a noteworthy positive correlation since waking up at a later hour
increases the duration of sleep. It seems that
Alcohol.consumption has a mild positive correlation with
Light.sleep.percentage and a mild negative correlation with
Deep.sleep.percentage, potentially because alcohol is a
depressant known to disrupt normal sleep cycles. I suspect that
Wakeup.time and Exercise.frequency share a
moderately negative correlation because individuals who exercise more
often may tend to wake up earlier in the day to have enough time for a
workout session.
I was shocked that age, alcohol & caffeine consumption, exercise
frequency, and light & deep sleep percentage had little to no
overall correlation to Sleep.duration or
REM.sleep.percentage, yet exhibited their
predicted/predictable relationships to
Sleep.efficiency.
The heat-map plot exhibits a -0.02 correlation score between
Sleep.duration and Sleep.efficiency. It seems
pretty intuitive to me that sleeping longer would have some positive
relationship to overall sleep efficiency, so I was surprised that the
score wasn’t even slightly positive. Let’s further investigate the
distribution of the two variables with a simple barplot.
# Simple barplot of grouped durations against sleep scores
sleep %>%
ggplot(aes(x = as.factor(Sleep.duration), y = Sleep.efficiency)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Sleep Efficiency by Sleep Duration",
x = "Sleep Duration", y = "Sleep Efficiency Score") +
theme_minimal()
The roughly equal bar heights suggest that across various sleep durations, the corresponding sleep efficiency scores don’t vary much. I’m surprised because I had always assumed that more hours of sleep would positively correlate to better quality of sleep.
sleep %>%
ggplot(aes(x = Sleep.efficiency, y = Gender, fill = Smoking.status)) +
geom_boxplot() +
labs(title = "Boxplot of Sleep Efficiency vs Gender & Smoking Status",
x = "Sleep Efficiency Score",
y = "Gender") +
scale_fill_manual(name = "Smoking Status",
values = c(No = "#FF9966", Yes = "#33CC99")) +
theme_bw()
From the above grouped boxplot, we initially see that the median sleep efficiency of non-smokers is higher than that of smokers regardless of gender. In addition, the scores are more clustered for non-smokers and more spread amonst smokers. For women, the median smoking sleep score is lower than the first quartile of non-smokers. If we take a look at the maxima (far right end of whiskers), we see that the highest observed sleep efficiency scores belong to non-smokers, both males and females. Among female smokers, the interquartile range extends widely, suggesting that the impact of smoking is not very concrete and that the middle half of this sample.
sleep %>%
# Points & approximate line for 'Alcohol.consumption'
ggplot(aes(x = Alcohol.consumption, y = Sleep.efficiency)) +
geom_jitter(aes(color = "Alcohol"), size = 0.4, width = 0.5) +
geom_smooth(aes(color = "Alcohol"), method = "lm", show.legend = TRUE) +
# Points & approximate line for 'Exercise.frequency'
geom_jitter(aes(x = Exercise.frequency, color = "Exercise"),
size = 0.4, width = 0.5) +
geom_smooth(aes(x = Exercise.frequency, color = "Exercise"),
method = "lm", show.legend = TRUE) +
labs(x = "Alcohol Consumption (oZ/day) & Weekly Exercise Frequency",
y = "Sleep Efficiency Score",
title = "Sleep Efficiency vs Alcohol Consumption & Exercise") +
scale_color_manual(name = "Predictors",
values = c(Alcohol = "firebrick2",
Exercise = "springgreen4"))
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
After removing the observations with missing values, it becomes
apparent that Alcohol.consumption and
Exercise.frequency exhibit opposing trends when plotted
against the outcome. Although both predictors’ observation points are
quite randomly scattered, we ran geom_smooth() function to approximate
slight linear patterns. As expected, increased daily alcohol consumption
corresponds to lower sleep efficiency while more frequent exercise
corresponds to higher sleep efficiency scores.
Now that we have explored potential relationships between our variables and gauged which predictors might have the most impact sleep efficiency scores, we can finally begin incrementally setting up our predictive model! This process starts with splitting our data into random train-test sets, then developing a universal recipe, and finally cross-validating our results across k-folds.
Before moving forward, we must split our data into two subsets: the training set and the testing set. A 70/30 ratio between train-test is pretty standard, so let’s continue using a similar proportion to split our sleep data. It’s essential to splice the data as such, so we aren’t at risk of over-fitting our model. We use the training set initially because it teaches machine learning models to pick-up on important patterns and relationships within the data. Usually, the model with the lowest root-mean-squared-error (RMSE) is considered the “best” model. Once we determine a best model, we fit it to the testing set (left alone until this later step) to evaluate how well it performs on an entirely new collection of data.
In order to maintain and reproduce our data throughout the project,
I’ll go ahead and set a random seed. Next, I will create my train-test
split, stratifying on the outcome variable to ensure that each subset
contains an equal, well-representative distribution of
Sleep.efficiency.
set.seed(123)
# Split the data, stratifying on `Sleep.efficiency`
sleep_split <- initial_split(sleep, prop = 0.75, strata = Sleep.efficiency)
sleep_train <- training(sleep_split)
sleep_test <- testing(sleep_split)
I ultimately decided on a 75/25 train-test split to reserve even more data to train my model on, while also leaving enough observations to test on. Before starting on the recipe, let’s run the following code chunks as a sanity check to verify we split the data correctly…
count(sleep_train)/nrow(sleep)
## n
## 1 0.7474227
count(sleep_test)/nrow(sleep)
## n
## 1 0.2525773
Perfect! The training set contains approximately 75% of our total sample, and the testing set contains the remaining 25%.
Over the course of our analysis, we will continue using the same predictors, model conditions, and response variable. Therefore, we will write one universal recipe to help preprocess the data that we’ll apply within various model setups and training procedures. Each of the model we choose will utilize this recipe using the specific methods associated with that type of model.
We will be using 11 out of 14 total predictors that we described
earlier: Age, Gender, Bedtime,
Sleep.duration, REM.sleep.percentage,
Deep.sleep.percentage, Awakenings,
Caffeine.consumption, Alcohol.consumption,
Smoking.status, and Exercise.frequency.
Although they provided helpful insights during our EDA, note that we
will not consider the predictors Light.sleep.percentage,
Wakeup.time, and Age.group for the sake of
avoiding redundancy and risk of multi-collinearity.
We already dealt with missingness way earlier! To recap, we concluded
it was more appropriate to remove the missing values rather than impute
them based on the nature/distribution of Awakenings,
Caffeine.consumption, Alcohol.consumption, and
Exercise.frequency.
We’ll then dummy encode all our categorical variables:
Gender and Smoking.status. Then, let’s make
sure to normalize our data by centering and scaling. This step will
enhance the interpretability and performance of our models.
sleep_recipe <- recipe(Sleep.efficiency ~ Age + Gender + Bedtime +
Sleep.duration + REM.sleep.percentage +
Deep.sleep.percentage + Awakenings +
Caffeine.consumption + Alcohol.consumption +
Smoking.status + Exercise.frequency,
data = sleep_train) %>%
# STEP1: Dummy-encoding nominal predictors
step_dummy(Gender, Smoking.status) %>%
# STEP2: Standardizing predictors
step_scale(all_predictors()) %>%
# STEP3: Normalizing predictors
step_center(all_predictors())
To better evaluate our model’s accuracy and prevent potential over-fitting, we can turn to K-fold cross validation. We often opt for K-fold cross validation, as it’s more rigorous than a simple train-test split.
This method randomly splits the training data into k-number of equally-sized subsets called “folds”. For each iteration of the process, RStudio reserves one fold as the testing set, and assigns the remaining k-minus-1 folds as the training set. Next, it fits the specified model to each training set and tested on the corresponding testing set.
It repeats the above process for each fold (or k-times), so that each fold acts as the testing set exactly once. This way, the method is validated on a different fold each time. Lastly, model performance is evaluated by averaging the testing accuracies gathered from each of the k-folds (other metrics can be used as well).
Let’s choose k=10: 10 folds is a common number to choose in a machine learning setting, and is typically rigorous enough to yield a model testing accuracy with low bias and modest variance. We will train the model on all but one of the folds, validating our model on that last remaining fold. For 10 folds, we repeat the process 10 times.
Before we forget, we should remember to stratify on our outcome
variable, Sleep.efficiency, to prevent data imbalances
across each of the ten folds.
# Creating our folds
sleep_folds <- vfold_cv(sleep_train, v = 10, strata = Sleep.efficiency)
After all this tedious preparation, we’re finally ready to build our models! ## Performance Metric We will use the Root Mean Squared Error (RMSE) as our performance metric because it’s a popular choice for regression tasks, such as this project.
The RMSE measures the average deviation between a statistical model’s predictions versus the true values. In a mathematical sense, it captures the standard deviation of the residuals (distance between actual data points and the regression line). RMSE values are expressed in the same units as the outcome variable, and can range anywhere from zero—indicating a perfect model fit—to infinity. Thus, the lower the value is, the better the fit.
It’s a good thing we already normalized our data within
sleep_recipe because RMSE is a measure of distance. We can
apply it as a universal and easily-interpretable metric for all our
models!
We will create and test out several different techniques that we’ve
learned throughout the course, PSTAT131 (Statistical Machine Learning).
Each model will follow the same universal sleep_recipe that
we created earlier. Once we fit and evaluate our 5 different models,
we’ll choose the best 2 models based on lowest RMSE. Then we’ll analyze
those two even further!
To fit each of our six models, we will follow a fairly straightforward and consistent series of steps detailed below:
1) Set up each model: Specify the type of model you are going to fit, the tuning parameters (if applicable), the mode (always “regression” for this project), and finally the engine that the model is sourced from.
# LINEAR REGRESSION
lm_model <- linear_reg() %>%
set_engine("lm")
# RIDGE REGRESSION
ridge_model <- linear_reg(mixture = 0,
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
# K-NEAREST NEIGHBORS
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn")
# RANDOM FOREST
rf_model <- rand_forest(trees = tune(),
mtry = tune(),
min_n = tune()) %>%
set_mode("regression") %>%
set_engine("ranger", importance = "impurity")
# GRADIENT BOOSTED TREE
boost_tree_model <- boost_tree(trees = tune(),
learn_rate = tune(),
min_n = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
2) Set up each workflow: Set up a workflow for each
model by adding the model and our preexisting
sleep_recipe.
# LINEAR REGRESSION
lm_wf <- workflow() %>%
add_model(lm_model) %>%
add_recipe(sleep_recipe)
# RIDGE REGRESSION
ridge_wf <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(sleep_recipe)
# K-NEAREST NEIGHBORS
knn_wf <- workflow() %>%
add_model(knn_model) %>%
add_recipe(sleep_recipe)
# RANDOM FOREST
rf_wf <- workflow() %>%
add_model(rf_model) %>%
add_recipe(sleep_recipe)
# GRADIENT BOOSTED TREE
boost_tree_wf <- workflow() %>%
add_model(boost_tree_model) %>%
add_recipe(sleep_recipe)
3) Create tuning grids: For each hyper-parameter we want to tune, specify their ranges and the number of levels. This step will require a few rounds of testing and adjusting before we can strike an appropriate combination of tuning parameters to yield the best fit for each model. Some models (such as linear regression) don’t have any tuning parameters that we need to specify or create a grid for.
# LINEAR REGRESSION -- No tuning parameters
# RIDGE REGRESSION
penalty_grid <- grid_regular(penalty(range = c(0.001,1)), levels = 50)
# K-NEAREST NEIGHBORS
knn_grid <- grid_regular(neighbors(range = c(1, 20)), levels = 5)
# RANDOM FOREST
rf_grid <- grid_regular(mtry(range = c(1,9)),
trees(range = c(50, 500)),
min_n(range = c(5, 20)),
levels = 8)
# GRADIENT BOOSTED TREE
boosted_grid <- grid_regular(trees(range = c(10, 500)),
learn_rate(range = c(0.01, 0.1),
trans = identity_trans()),
min_n(range = c(1, 20)),
levels = 5)
4) Fitting models: Let’s fit our models to our folded sleep data. To do this, we have to specify the workflow, “K-fold” resamples, and the tuning grid.
# LINEAR REGRESSION -- No tuning
lm_fit <- fit_resamples(lm_wf, resamples = sleep_folds)
# RIDGE REGRESSION
ridge_tune <- tune_grid(
ridge_wf,
resamples = sleep_folds,
grid = penalty_grid
)
# K-NEAREST NEIGHBORS
knn_tune <- tune_grid(
knn_wf,
resamples = sleep_folds,
grid = knn_grid
)
# RANDOM FOREST
rf_tune <- tune_grid(
rf_wf,
resamples = sleep_folds,
grid = rf_grid
)
# GRADIENT BOOSTED TREE
boosted_tune <- tune_grid(
boost_tree_wf,
resamples = sleep_folds,
grid = boosted_grid
)
5) Saving RDS Files: Let’s save each of our tuned models to an RDS file to avoid waiting for long rerunning times.
# RIDGE REGRESSION
write_rds(ridge_tune, file = "ridge.rds")
# K-NEAREST NEIGHBORS
write_rds(knn_tune, file = "knn.rds")
# RANDOM FOREST
write_rds(rf_tune, file = "rf.rds")
# GRADIENT BOOSTED TREE
write_rds(boosted_tune, file = "boost.rds")
6) Loading RDS Files: Now, we can load the saved RDS files back into our workspace without the long wait!
# RIDGE REGRESSION
tuned_ridge <- read_rds(file = "ridge.rds")
# K-NEAREST NEIGHBORS
tuned_knn <- read_rds(file = "knn.rds")
# RANDOM FOREST
tuned_rf <- read_rds(file = "rf.rds")
# GRADIENT BOOSTED TREE
tuned_boost <- read_rds(file = "boost.rds")
7) Collecting metrics from tuned models: To find the
model that produces the lowest RMSE, we can use
select_by_one_std_error().
# LINEAR REGRESSION
best_lm <- show_best(lm_fit, metric = "rmse")
# RIDGE REGRESSION
best_ridge <- select_by_one_std_err(tuned_ridge,
metric = "rmse",
penalty)
# K-NEAREST NEIGHBORS
best_knn <- select_by_one_std_err(tuned_knn,
metric = "rmse",
neighbors)
# RANDOM FOREST
best_rf <- select_by_one_std_err(tuned_rf,
metric = "rmse",
mtry, trees, min_n)
# GRADIENT BOOSTED TREE
best_boost <- select_by_one_std_err(tuned_boost,
metric = "rmse",
trees, learn_rate, min_n)
Now that we’ve used select_by_one_std_error() to pick
the best tuned model per each of our five models, let’s compare the
results. We can return each RMSE in a nicely-formatted tibble, and
arrange them in order of best to worst performance.
# Tibble with all the models and their
final_comparison_table <- tibble(Model = c("Linear Regression",
"Ridge Regression",
"K Nearest Neighbors",
"Random Forest", "Boosted Trees"),
RMSE = c(best_lm$mean, best_ridge$mean,
best_knn$mean, best_rf$mean,
best_boost$mean))
# Displaying in order of lowest to highest RMSE
final_comparison_table <- final_comparison_table %>%
arrange(RMSE)
# Returning the tibble results
final_comparison_table
## # A tibble: 5 × 2
## Model RMSE
## <chr> <dbl>
## 1 Boosted Trees 0.0496
## 2 Random Forest 0.0498
## 3 Linear Regression 0.0629
## 4 K Nearest Neighbors 0.0662
## 5 Ridge Regression 0.117
Our next task is to visualize our RMSE results, and we can create this representation in the form of a barplot.
# Creating data frame of all models & their RMSE
five_models <- data.frame(Model = c("Linear Regression",
"Ridge Regression",
"K Nearest Neighbors",
"Random Forest", "Boosted Trees"),
RMSE = c(best_lm$mean, best_ridge$mean,
best_knn$mean, best_rf$mean,
best_boost$mean))
# Bar-plot of the RMSE's
five_models %>%
ggplot(aes(x = Model, y = RMSE)) +
geom_bar(aes(fill = Model), stat = "identity") +
labs(title = "Evaluating Model Performance Via RMSE") +
theme_classic()
Based on our RMSE tibble and bar-plot, it’s clear to see that “Random Forest” and “Gradient Boosted Trees” were the two models with the lowest RMSE—and thus, in our criteria, performed the best on the cross-validation data. We should also note that “Ridge Regression” performed the worst by far. In our case, the more complex models performed better in comparison to simpler ones such as linear regression. So, this may suggest that the relationships in our data aren’t fully linear.
Auto-plots can help us better understand the impact that tuning parameters has on the overall model performance. Let’s continue using the RMSE as our metric of choice to visually investigate our best two models.
autoplot(tuned_rf, metric = "rmse")
For Random Tree Forest, we tuned a range of 1 to 9 predictors to be randomly sampled at each fit. Although there are 11 total predictors, we decided on a maximum of 9 to avoid the risk of over-fitting. Randomly picking from a subset of 1-9 predictors out of the total 11 for each tree helps the model capture different relationships between predictors and the outcome. I specified the range of trees as 50-500, but we can see above that the number of trees has little to no discernible impact on performance. It’s a little hard to see at first, but the minimal node size of 5 seems to correspond with slightly lower RMSE values. The plots for all 8 levels exhibit that the greater the number of randomly selected predictors, the smaller the RMSE will be.
autoplot(tuned_boost, metric = "rmse")
The RMSE exhibits a substantial decrease initially as the number of
trees increases to a threshold of approximately 100 trees. Before we hit
this threshold, a higher learning rate parameter yields better model
performance. Once there ~100 trees, the model performs noticeably
better. We tuned the learning rate to a range of 0.01 to 0.1 because too
large of a value will lead to some kind of trade-off: the model will
learn at a quicker speed, but will produce overly specific or
over-fitted results. Our goal is to strike a balance between model
complexity and generalization. It seems like the minimal node size has
little impact on the RMSE, but upon closer inspection, it seems that a
min_n of 20 performs slightly better than the other
sizes.
Great! We’ve now gathered that Random Forest performed the best out of our 5 total models. So, let’s investigate it more.
best_rf
## # A tibble: 1 × 11
## mtry trees min_n .metric .estimator mean n std_err .config .best
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl>
## 1 3 50 5 rmse standard 0.0498 10 0.00264 Preprocessor… 0.0476
## # ℹ 1 more variable: .bound <dbl>
The parameters that yielded the best Random Forest model are
mtry = 3, trees = 50, and
min_n = 5. It performed the best based on producing the
lowest RMSE of 0.04911741.
Now that we have our best model, let’s go ahead and re-fit it to the training set. Then, we’ll be able to use it with the testing data.
# Selecting best model
best_rf_train <- select_best(tuned_rf, metric = "rmse")
# Finalizing the workflow with best_rf model
rf_final_workflow_train <- finalize_workflow(rf_wf, best_rf)
# Fitting to training set
rf_final_fit_train <- fit(rf_final_workflow, data = sleep_train)
# Saving to RDS file
write_rds(rf_final_fit_train, file = "rf_final_train.rds")
Lastly, it’s time to test our model on some new data. Let’s fit it to
a set of completely new and untrained data, sleep_test!
# Reloading the fitted training data
rf_final_fit_train <- read_rds(file = "rf_final_train.rds")
# Making a tibble of predicted and actual values
best_model_tibble <- predict(rf_final_fit_train,
new_data = sleep_test %>%
select(-Sleep.efficiency))
best_model_tibble <- bind_cols(best_model_tibble,
sleep_test %>% select(Sleep.efficiency))
# Saving to RDS file
write_rds(best_model_tibble, file = "final_model.rds")
# Reloading the final model
best_model_tibble <- read_rds(file = "final_model.rds")
# Naming RMSE as official metric
sleep_metric <- metric_set(rmse)
# Gathering RMSE from testing set
best_model_tibble_metrics <- sleep_metric(best_model_tibble,
truth = Sleep.efficiency,
estimate = .pred)
best_model_tibble_metrics
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 0.0496
Interesting! Our Random Forest model actually performed better on the training set’s cross-validation folds than on the testing set. The difference between the RMSE’s was extremely small, though… 0.04911741 for training versus 0.04936463 for testing. As described in the beginning, the RMSE metric is described in terms of the original units. In our case, it’s expressed as a proportion that can range from 0 to 1. Being said, an average RMSE of 0.049 is pretty decent—not extremely small, but also not large by any means. It’s safe to say that model performed pretty well!
Next, we should visually compare the actual values against our model’s predictions.
# Scatterplot of prediction V actual
best_model_tibble %>%
ggplot(aes(x = .pred, y = Sleep.efficiency)) +
geom_point(alpha = 0.5) +
geom_abline(lty = 3) +
coord_obs_pred() +
labs(title = "Predicted Values vs. Actual Values") +
theme_minimal()
The dots form a vaguely straight line, although there are many outliers. The line isn’t as dark or defined as anticipated, but probably because we were only dealing with 388 total observations. There are lots of points that stray from the line, which is surprising based on our fairly low RMSE. But, I think our model performed decently overall!
In conclusion, I gathered that the best model for predicting sleep efficiency scores is Random Forest. The Random Forest model is good at capturing more complex relationships such as determining the relative importance of each predictor versus the response variable. The RMSE we got was fair… It could have been smaller than 0.049, but it also didn’t perform horribly.
Both Random Forest and Boosted Trees performed better than our 3 other models. This result indicates that the response variable most likely follows some kind of parametric (thus non-linear) form. In other words, sleep efficiency is determined by a non-linear combination of all the predictor variables we decided to include in our recipe. So, it makes sense that the Linear Regression method did not yield a very low RMSE.
K-Nearest Neighbors provided an even worse estimate potentially because the method is known to be a poor estimate for models with that are in a high-dimensional space (in our case, we included 11 predictors) with few observations (in our case, 388 total) to offset the high dimensions. In our case, we included 11 predictors across 388 total data points, which resulted in the observations being too spread apart for KNN to predict with a high accuracy.
The worst model by far was Ridge Regression because it assumes a linear pattern between the predictors and outcome. By nature, this model is simplistic and thus is at risk of under-fitting the data when the penalty is too high. There may have been some complex or multi-collinear relationships that we failed to capture in our EDA, and Ridge Regression doesn’t respond to this well.
In the future, we could adjust the formulation of the Random Forest
model to generate even more accurate predictions of sleep efficiency.
The RMSE ended up being higher when we fit the Random Forest model to
our testing set. This could be reflective of the 75/25 test-train ratio
that we decided to use—maybe there were some outlying observations in
the 25% reserved for training. 50 trees yielded the best performance, so
our data probably isn’t complex enough to need anywhere near the 500
trees we tuned the upper half of the range to. We could have decreased
the range for the trees parameter and explored its exact
effect more rigorously.
A potential question is whether we should have chosen to keep
Light.sleep.percentage as a predictor in our recipe, or if
it would’ve been wiser to convert Deep.sleep.percentage,
Light.sleep.percentage, and
REM.sleep.percentage into categorical variables with the
levels “yes” and “no”. Maybe we should have converted
Bedtime and Wakeup.time into factors
categorizing whether the time was early versus late. In addition, I
would like to explore if the project were more suitable as a
classification problem that would require us to establish a few cut-off
criteria associated with different sleep efficiency percentages. Then,
interpreting the significance of the outcome variable’s values may
become more straightforward.
If we were to continue enhancing this predictive model, it would be helpful to gather more data observations and explore more predictors related to sleep patterns and general health, because there are so many factors that contribute to the complex science of human sleep. Here are a few ideas I have: perceived stress levels, smartphone usage, general health, socioeconomic status, hormonal levels based on sex, medication usage, and recreational drug usage (especially marijuana). I think this would be especially intriguing since marijuana is a very commonly-used substance among college students, yet its usage is known to decrease the amount of time spend in REM sleep. A completely new project idea would be predicting one’s perceived happiness based on sleep efficiency and all the lifestyle predictors above.
On the whole, this Statistical Machine Learning project taught me how to implement all of the methods we learned throughout the quarter and apply it to a topic that I’ve been personally interested in exploring. Although the process was quite time-consuming and challenging, I learned how to troubleshoot independently and gained a lot more confidence in terms of approaching an individual project!
I sourced the sleep_efficiency.csv file for this project
off of Kaggle, which you can easily access
here: https://www.kaggle.com/datasets/equilibriumm/sleep-efficiency/data.
The data was scraped from a sleep study conducted in Morocco by a group
of AI engineering students from ENSIAS, and uploaded by the user
EQUILIBRIUMM onto Kaggle for public usage. The original study aimed to
investigate the impact of certain lifestyle factors on sleep quality and
sleep patterns.